• Team Members
  • Problem Statement
  • Data Description
  • Data Preprocessing
  • Statistical Approaches
  • Results
  • Discussion
  • References

Team Members

  • Eugene Gato Nsengamungu
  • Ernesto Ye Luo
  • Kathleen Cimino

Problem Statement

This analysis will attempt to explore if trans people experience discrimination differently based upon factors such as their socioeconomic status and gender identity. This problem is two-fold. First, are there any meaningful clusters among the trans respondents to this survey based upon demographic information? Second, do these clusters of people experience these forms of discrimination differently?

Traditionally trans people have been treated as either a homogenous group, or split according to assigned gender at birth (AGAB) and gender identity. However, these might not be the best ways to cluster trans people given the wide variety of social situations trans people exist in. While some research has started to acknowledge the heterogeneity in the trans community and account for it (Waldman 2018, Tatum et al. 2020, Bakko et al 2020, Downing et al 2022, Motmans et al 2012, Bradford and Catalpa 2019, Matsuno et al 2017, Harrison 2012, Motmans et al 2012), few if any studies attempt to understand the heterogeneity itself. Additionally, there is cause to believe that trans people’s experiences with the forms of discrimination they typically experience differ based on their gender identity (James et. al, 2015, Veale et. al, 2019), so it is worth exploring if this holds after attempting to cluster on other factors. This exploration of the experiences of trans people can potentially provide some guidance for areas of future research and ensure any policies attempting to address transphobia will help all trans people rather than a select few.

Data Description

The TransPop dataset is the first national probability sample of transgender individuals in the United States and was conducted by researchers at the Williams Institute at UCLA School of Law, Columbia University, Harvard University, and The Fenway Institute at Fenway Health. It can be accessed through ICPSR at https://www.icpsr.umich.edu/web/DSDR/studies/37938. Participants were recruited through and screened by Gallup to only include individuals who identify as transgender. The data was collected in two periods, from April 2016 and August 2016 and from June 2017 and December 2018, with a final sample size of 274. This survey was conducted online, over the phone, and through a mail questionnaire. Data was collected using the TransPop survey, which included questions on demographic information, community characteristics, sexual attraction and activity, other’s perception of respondent’s gender, interest in and access to gender-affirming treatment, health care access and general health status, mental health, violence victimization, discrimination, employment, housing, and other general questions about the transgender experience. As part of this it includes several validated scales including both imputed and non-imputed versions which are designed to measure constructs related to identity, stress, and health.

The variables of interest for this analysis will be those covering demographic information and experiences of discrimination. Within the subset, the only true interval data is the age variable, all of the other variables are nominal or ordinal data. Age is bimodal and right skewed with a median of 34.

Data Preprocessing

Questions 163, 166, 168, and 202 are recoded to be binary variables where 1 is present and 0 is not present because NA is a meaningful option. All variables that were either binary or on a Likert scale (or similar scale, most commonly one asking if something had happened never,once, twice,three times, or four or more times) were converted to numeric values. Observations with missing values were omitted because there were only 33 (12% of the data), and imputation would be too computationally expensive given the number of variables.

The variables used for clustering were standardized if they were interval data, and all of the categorical data was already coded as factors.

Statistical Approaches

For this analysis, we used to following statistical approaches:

Principal Component Analysis (PCA):

PCA is a data reduction technique that seeks to reduce the dimensionality of the variables in a dataset for the purpose of increasing interpretability while minimizing information loss.

Specifically, it replaces a large number of highly correlated variables in a dataset with a smaller set of composite variables called principal components (PCs) which are linear combinations of the original variables. PCs represent the amalgamation of the original variables in a dataset. This representation is determined through finding the line that explains the highest possible variance of a dataset based on the variables that are most correlated with each other. The rest of the variables that are not represented by the previous PC can possibly be represented in the next PC, which is as the line that is perpendicular to the previous PC. This next PC also seeks to explain the highest possible variance of the dataset based on the variables that are most correlated with each other. This process is repeated until there are enough PCs to explain 100% of the cumulative variance of a dataset. The first few PCs usually explain most of the variance in a dataset. Since the goal of using PCA is to reduce the dimensionality of the variables in a dataset, then one way to do that is to select the PCs that explains the most variance. Because PCs represent the amalgamation of the original variables in a dataset and how much variance those variables explain, PCA can reduce the dimensionality of the variables in a dataset by replacing many variables with a few PCs that explain enough of the variance in those variables.

The number of PCs one should select can be determined by three main approaches: 1) a scree plot, 2) the Kaiser Rule, 3) by selecting the number of PCs that explain at least 80% of the cumulative variance of a dataset. The scree plot is a graph that shows the highest possible number of PCs that explains 100% of the dataset in the x-axis and the eigenvalue in the y axis. Typically, there is an “elbow point” in the plot that usually determines how many PCs one should select for their PCA. The “elbow point” can be identified as the point where the line graph starts to bend significantly compared to the other points. Another approach is the Kaiser Rule, which states that one should select the number of PCs that have greater than 1 eigenvalue (since eigenvalue is equivalent to the amount of variance explained by a PC, then PCs with a higher eigenvalue explain a higher amount of variance. Lastly, the third approach is to select the number of PCs that explain at least 80% of the cumulative variance of a dataset.

To conduct PCA, first one must standardize all values of a dataset and the variables must be numeric. Standardizing all values of a dataset refers to transforming all variables into the same scale. This is an important step for ensuring greater validity in PCA.

Exploratory Factor Analysis (EFA):

EFA is a data reduction technique that seeks to uncover latent variables that explain the relationship of all similar variables in a dataset. It can be used for acquiring greater insight on the underlying relationships between measured variables.

EFA creates representations of the relationships between similar variables by organizing the most correlated variables into groups called factors. The variables with the highest correlation among each other are grouped together into one factor. The ones that are not grouped together in the previous factor will be grouped in the subsequent factors, using the same method of finding the most highly correlated variables among each other.

Once EFA creates factors of variables in a dataset, it is up to the researcher to determine the name for those factors. For instance, in the case of our analysis, we used EFA to group highly correlated variables such as if the participant felt that they were prevented from moving into or buying a house or apartment due to their gender, race, religion, physical appearance, etc. into a factor called housing discrimination. Thus, EFA creating factors that represent variables that are most highly correlated with each other can help researchers acquire greater insight on the underlying relationships between measured variables.

The number of factors one should choose in EFA is determined by the least amount of factors that explain the most adequate amount of variance in all the variables in the dataset. Similar to PCA, one can find the least amount of factors that explain the most adequate amount of variance in all the variables mainly through scree plot, Kaiser Rule, or selecting number of factors that explain at least 80% of the cumulative variance of a dataset.

Cluster Analysis:

Cluster Analysis is a technique that is designed to uncover subgroups of observations in a dataset through finding clusters of observations that are most similar to each other. There are many types of cluster analysis; they vary by the methods used to conduct the analysis. The methods we used are Partitioning around Medoids (PAM) using the Gower distance and t-distributed stochastic neighbor embedding (t-SNE).

We used Gower distance for our cluster analysis because Gower distance is suited to calculate a mix of categorical and numeric variables in a dataset. Gower distance is suited to this because it calculates the dissimilarity between variables. For each numeric variable, dissimilarity between observations is calculated by the absolute difference between the values, divided by its range. For each categorical variable, dissimilarity between observations is calculated with a value of 0 (represents equal) or 1 (represents different). For instance, if two observations has the values of “Female” in their sex variable, then the dissimilarity would be calculated as 0. In contrast, if one observation has the value of “Female” and the other observation has the value of “Male,” then the dissimilarity would be calculated as 1. Lastly, Gower distance adds all these dissimilarity values and divides the result by the number of variables used to calculate those dissimilarity values.

PAM is a form of cluster analysis that uses an observation within a cluster as representing the cluster rather than simply a point which might not be an observation which makes it robust to outliers, and it can be used on categorical data. Given that our data was predominantly nominal and ordinal, and that there was some risk of outliers, we decided to use this method. To determine the number of clusters we should use in our PAM cluster analysis, we used the Silhouette plot. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. After computing silhouette coefficients for each point, it averages them for all the samples to get the silhouette score. The silhouette score can range from -1 to 1. The Silhouette plot then graphs the silhouette score of each possible number of clusters. Higher silhouette score indicates higher cohesiveness in the number of clusters. In other words, a silhouette score of -1 is the least cohesive and a silhouette score of 1 is most cohesive. Thus, when choosing the number of clusters using the silhouette plot, we should choose the number of clusters that has the highest silhouette score.

We used t-SNE to visualize our clusters because it provides a good representation of our dataset, given that it helps visualize high-dimensional data while preserving local structure and it can account for non-linear relationships between variables. The problem with visualizing high-dimensional data is that a lot of information can be simplified in a way that does not represent the characteristics of the dataset. However, t-SNE seeks to simplify the visualization of high-dimension data, while preserving certain essential characteristics in our dataset. The original distances between data points in the dataset can be preserved in t-SNE. This can provide a more accurate representation of the distances of data points and the distances of clusters. Additionally, t-SNE can account for non-linear relationships, which can provide a more accurate representation of the relationship between variables.

Results

Principal Components Analysis

This analysis attempted to use principal component analysis to reduce the fairly high dimensionality of the dataset, so as to make our cluster analysis easier and more interpretable. For the Principal components analysis, several solutions were attempted. Based on the scree plot, a PCA using two, three, and four factors was attempted. When that failed, eight factors were used based on the intersection of the real and simulated eigenvalues. Solutions using fifteen and twenty-three components were also attempted based on the Kaiser rule and by selecting the least amount of PCs that explains at least 80% of the variance of the dataset respectively. However, for all of these solutions all variables loaded somewhat strongly on the first component, and there was no clear pattern to the loadings on the other components. The graph below showing the factor loadings for the eight component solution is representative of the sort of loadings found by all of these solutions.

fit2 <- PCA(discgen, nfactors=3)
plot(fit2, sort=TRUE)

Exploratory Factor Analysis

For the Exploratory Factor analysis, two solutions were attempted. First, an exploratory factor analysis was conducted on all of the variables related to discrimination. As can be seen in the graph below, the factor analysis did find three factors (though they are not as nicely separated as might be desired). However, after looking at the corresponding questions these results were not especially intelligible. To attempt to fix this, the variables were divided into variables dealing with general discrimination and variables dealing with trans-specific discrimination. After using scree plots again, three factors were fitted to each of these subsets. As can be seen in the graphs below, variables tended to load nicely onto the factors. Looking at what these variables were measuring, the factors were fairly intelligible. For the factors related to general discrimination, factor one dealt with discrimination related to housing (e.g.”Since the age of 18, how often were you prevented from moving into or buying a house or apartment by a landlord or realtor?”), factor two dealt with discrimination related to direct violence and employment (e.g. “Since the age of 18, how often have any of the following happened to you? Someone threw an object at you” and “Since the age of 18, how often were you fired from your job or denied a job?”), and factor three dealt with discrimination related to healthcare (e.g. “When seeking healthcare, I worry that diagnoses of me/my health may be negatively affected by my gender identity or sexual orientation”).

fitgen <- FA(discgen, nfactors=3, rotate="varimax")
plot(fitgen, sort=TRUE)

For the factors related to trans specific discrimination, factor one dealt with discrimination related to a person’s gender being illegible (e.g. “People don’t respect my gender identity because of my appearance or body”), factor two dealt with discrimination related to internalized transphobia (e.g “Because I am transgender, I feel like an outcast”), and factor three dealt with discrimination related to a person needing to hide that they are trans (e.g. “Because I don’t want others to know my gender identity/history, I modify my way of speaking”).

fittrans <- FA(disctrans, nfactors=3, rotate="varimax")
plot(fittrans, sort=TRUE)

After looking at group differences for all of the demographic variables for each factor, there were statistically significant differences between the response groups for discrimination related to housing based on visibility (p-value < .005) and sexual orientation (p-value <.005). For direct violence and employment, there were statistically significant differences between groups for disability (p-value <.005) and sexual orientation (p-value = .029).

FAgen <- df %>%
  score(fitgen) %>% 
  rename(visibility=q44)

groupdiff(FAgen, F1, visibility)
## $result
## [1] "F(4, 236) = 3.63, p-value = 0.0068"
## 
## $summarystats
##             visibility  n  mean   sd
## 1     (4) Occasionally 70 -0.14 0.30
## 2            (5) Never 72 -0.07 0.51
## 3 (2) Most of the time 30  0.06 0.86
## 4        (3) Sometimes 62  0.07 1.26
## 5           (1) Always  7  1.22 3.16
## 
## $plot

For the trans specific factors, discrimination related to a person’s gender being illegible differs by disability (p-value< .005), gender identity (p-value <.005), gender presentation (p-value < .005), and visibility (p-value <.005). Discrimination related to a person needing to hide that they are trans differs based on visibility (p-value < .005).

FAtrans <- df %>%
  score(fittrans) 

ggplot(data=FAtrans,aes(x=sexualid,y=F2,fill=sexualid))+
  geom_boxplot()+
   theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1))

groupdiff(FAtrans, F1, trans)
## $result
## [1] "F(2, 238) = 25.98, p-value = 6.2e-11"
## 
## $summarystats
##                   trans   n  mean   sd
## 1   (1) Trans man (FTM)  64 -0.33 1.06
## 2 (2) Trans woman (MTF) 105 -0.23 0.91
## 3  (3) Gender nonbinary  72  0.63 0.65
## 
## $plot

It is important to note that some of these might be false positives, while there is theory to justify checking each of these individually (as well as anecdotal evidence to support the expectation that people with different identities within the trans community will experience different rates of discrimination) the sheer number of analyses run would all but guarantee at least one statistically significant finding.

Clustering

Clustering was done using partitioning around medoids. Four solutions were attempted. For each, a silhouette plot determined that the appropriate number of clusters was two. These solutions consisted of clustering around all of the demographic information available, experiences with discrimination, all of the demographic information and experiences with discrimination, and finally demographic information that was considered most likley to reveal clusters based on previous research. The intent was to find these clusters, then add them back to the full dataset to compare how each cluster differed in demographic characteristics and experiences with discrimination, adjusting the interpretation based on how the clusters were formed. However, all of these attempts failed to produce distinct clusters based on the t-SNE graphs. In all of the attempts, the clusters would be nearly identical in their demographic make up except consistently one cluster would be predominantly trans women and the other cluster would be predominantly trans men and nonbinary people, and typically the cluster with trans men and nonbinary people had more people who identified as queer. It is difficult to say much about this. While there is some reason to believe the experiences of trans men and nonbinary people might be more likely to be similar to one another (for example, they both experience relatively high rates of sexual assault and both struggle with invisibility while trans women struggle with hypervisibility), it is unclear if these findings are legitimate given that these clusters are not well separated.

ggplotly(p)

Below you can see the results from the clustering done on targeted demographic information, which consisted of the area of the country a person lived in, age, race, sexual orientation, gender identity, personal income, education level, if they were visibly trans, and presentation on a scale from masculine to feminine. You can see where there appears to be some differences in experiences with discrimination, and one cluster has predominantly trans women who present as feminine but is otherwise comparable to the other cluster.

crosstab(df, cluster, gcendiv, type="rowpercent", plot=TRUE) 

crosstab(df, cluster, geducation, type="rowpercent", plot=TRUE) 

crosstab(df, cluster, q44, type="rowpercent", plot=TRUE) 

crosstab(df, cluster, q42, type="rowpercent", plot=TRUE) 

crosstab(df, cluster, trans, type="rowpercent", plot=TRUE) 

crosstab(df, cluster, race, type="rowpercent", plot=TRUE) 

crosstab(df, cluster, sexualid, type="rowpercent", plot=TRUE) 

crosstab(df, cluster, pinc_i, type="rowpercent", plot=TRUE) 

df %>%
  standardize() %>%
  select(where(is.numeric) | cluster) %>%
  profile_plot()

Discussion

This analysis was not able to find meaningful clusters within the trans community based on the data available. This is worth noting, because while it might be due to the relatively small sample size, it does indicate that the trans community is more heterogeneous than previously expected. This study was also not able to reduce the high dimensionality of the data using PCA. However, it was able to find six factors related to experiences with discrimination after dividing the relevant variables into those relevant to general discrimination and those relevant to trans specific discrimination. This study highlights that any research or policies designed to study or address discrimination against trans people must take into consideration that the trans community is not homogenous and that their experiences with discrimination are varied. One vital area for future study is actually looking at this heterogeneity within the trans community and how it affects trans people’s experiences with discrimination using a survey or other methods explicitly designed to do such. The TransPop survey does not include the sort of questions necessary to determine to what extent factors like racism or ableism might affect trans people’s experiences with transphobia, and these could affect the efficacy of different policies as well as complicate any findings of analyses that do not account for them.

There are some limitations to this analysis. As noted before, the TransPop survey was not designed for attempting to study the differences in how trans people experience discrimination based on demographic factors. There were only 274 participants, and only 241 were included in the analysis after deleting observations with missing values.

Additionally, it is vital to note that this survey was conducted in 2016-2018. The trans community is rapidly changing due to the increased visibility of trans people. As trans people become more visible, more people realize they are trans, shifting the demographics of the trans community. This can be seen for example in the shift from the trans community being predominantly trans women (Arcelus, 2015) to now consisting of a nearly equal number of trans men and trans women among binary trans people (Leinung, 2020). This increased visibility has also come with changes in the forms of discrimination trans people face. There has been a recent increase in transphobic legislation, perhaps most infamously the bathroom bills, and in many areas it has become harder to pass because of this increased awareness of the existence of trans people. Ultimately this study can provide a snapshot of a particular time in the trans community, and a starting place for much needed further research.

References

Arcelus, J. “Systematic Review and Meta-Analysis of Prevalence Studies in Transsexualism.” European Psychiatry. 30, no. 6 (2015): 807–15. https://doi.org/10.1016/j.eurpsy.2015.04.005.

Bakko, Matthew, and Shanna K Kattari. “Transgender-Related Insurance Denials as Barriers to Transgender Healthcare: Differences in Experience by Insurance Type.” Journal of General Internal Medicine.: JGIM 35, no. 6 (2020): 1693–1700. https://doi.org/10.1007/s11606-020-05724-2.

Downing, Jae, Kendall A Lawley, and Alex McDowell. “Prevalence of Private and Public Health Insurance Among Transgender and Gender Diverse Adults.” Medical Care 60, no. 4 (2022): 311–15. https://doi.org/10.1097/MLR.0000000000001693.

Harrison J, Grant J, Herman JL. A gender not listed here: genderqueers, gender rebels, and otherwise in the national transgender discrimination survey. LGBTQ Public Policy Journal at the Harvard Kennedy School. 2012;2(1):13–24.

James, S.E., Herman, J.L., Rankin, S., Keisling, M., Mottet, L., Anafi, M. The Report of the 2015 U.S. Transgender Survey. Washington, DC: National Center for Transgender Equality.

Matsuno, E., Budge, S.L. Non-binary/Genderqueer Identities: a Critical Review of the Literature. Curr Sex Health Rep 9, 116–120 (2017). https://doi.org/10.1007/s11930-017-0111-8

Meyer, Ilan H. TransPop, United States, 2016-2018. Inter-university Consortium for Political and Social Research [distributor], 2021-06-23. https://doi.org/10.3886/ICPSR37938.v1

Motmans, Joz, Petra Meier, Koen Ponnet, and Guy T’Sjoen. “Female and Male Transgender Quality of Life: Socioeconomic and Medical Differences.” Journal of Sexual Medicine 9, no. 3 (2012): 743–50. https://doi.org/10.1111/j.1743-6109.2011.02569.x.

Tatum, A. K., Catalpa, J., Bradford, N. J., Kovic, A., & Berg, D. R. (2020). Examining identity development and transition differences among binary transgender and genderqueer nonbinary (GQNB) individuals. Psychology of Sexual Orientation and Gender Diversity, 7(4), 379–385. https://doi.org/10.1037/sgd0000377

Veale J, Byrne J, Tan K, Guy S, Yee A, Nopera T & Bentham R (2019) Counting Ourselves: The health and wellbeing of trans and non-binary people in Aotearoa New Zealand. Transgender Health Research Lab, University of Waikato: Hamilton NZ.

Waldman, H Barry. “Transgender People with Disabilities.” NYS Dental Journal 84, no. 2 (2018). https://doi.org/info:doi/.